Multivariate Data Analysis among Machine Learning
Linear Regression
We use Least Square to estimate the parameters of a linear model where β^=(Z′Z)−1Z′y where Z is the design matrix (data frame X with extra 1 column 1 at beginning) and y is the response vector. The residual (the deviations) is ϵ=y−Zβ^ and the residual sum of square is RSS=ϵ^′ϵ^.
- It comes from min∑i(yi−y^i)2 where y^i=β^0+β^1xi, and LS solution is unbiased where Cov(β^)=σ2(Z′Z)−1 and E(β^)=β; Cov(ϵ^)=σ2(I−H) and E(ϵ^)=0.
- We also define hat matrix H=Z(Z′Z)−1Z′ and H is the orthogonal projector matrix onto the column space of Z. Then, y^=Zβ^=Hy.
- We also have Z′(I−H)=0 and we can rewrite residual as ϵ^=y−y^=(I−H)y satisfy Z′ϵ^=0 and y^′ϵ^=0.
- Now RSS=ϵ^′ϵ^=y′(I−H)y=y′y−y′H′y=y′y−y′Zβ^.
- y′y=(y^+ϵ^)′(y^+ϵ^)=y^′y^+ϵ^′ϵ^ so we can have TSS=y′y−nyˉ2=y^′y^−nyˉ2+ϵ^′ϵ^=∑i=1n(yi−yˉ)2=∑i=1n(y^i−yˉ)2+∑i=1n(ϵ^i)2.
- Since R2 (the coefficient of determination) R2=1−TSSRSS, then R2=1−∑i=1nyi−yˉ∑i=1n(yi−y^i)=∑i=1nyi−yˉ∑i=1n(y^i−yˉ)
- If Z is not full rank, then β^=(Z′Z)−Z′y is the solution
Principal Component Analysis
We always use principal component to serve regression analysis or cluster analysis.
Given random vector X′=[X1,X2,…,Xp] have covariance matrix Σ with eigenvalues λ1≥λ2≥…≥λp≥0. Let says Y1=a1′X,…,Yp=ap′X where a1,…,ap. Then we have Var[Yi]=ai′Σai and Cov[Yi,Yj